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Abstract —We present a method of estimating the number of 
people in high density crowds from still images. The method 
estimates counts by fusing information from multiple sources. 
Most of the existing work on crowd counting deals with very small 
crowds (tens of individuals) and use temporal information from 
videos. Our method uses only still images to estimate the counts 
in high density images (hundreds to thousands of individuals). At 
this scale, we cannot rely on only one set of features for count 
estimation. We, therefore, use multiple sources, v/z. interest points 
(SIFT), Fourier analysis, wavelet decomposition, GLCM features 
and low confidence head detections, to estimate the counts. Each 
of these sources gives a separate estimate of the count along 
with confidences and other statistical measures which are then 
combined to obtain the final estimate. We test our method on an 
existing dataset of fifty images containing over 64000 individuals. 
Further, we added another fifty annotated images of crowds and 
tested on the complete dataset of hundred images containing 
over 87000 individuals. The counts per image range from 81 
to 4633. We report the performance in terms of mean absolute 
error, which is a measure of accuracy of the method, and mean 
normalised absolute error, which is a measure of the robustness. 


I. Introduction 

Crowd counting is one of the first and foremost parts of 
crowd management. It has several real-world applications like 
crowd management, safety control and urban planning, moni¬ 
toring crowds for surveillance, modelling crowds for animation 
and crowd simulation. Crowd size may also be an indicator of 
comfort level in public spaces or of an imminent stampede. 

Many automated systems for density and count estimation 
have been proposed. However, many of these techniques suffer 
from some limitations: inability to handle large crowds (hun¬ 
dreds or thousands of people); reliance on temporal constraints 
in crowd videos; reliance on detecting, tracking and analysing 
individual persons in crowds. Another important limitation 
that some of these methods suffer from is the requirement 
of installed infrastructure on the site. 

Most existing people counting methods can be divided into 
three categories: (1) pixel-based analysis; (2) texture-based 
analysis; and (3) object-level analysis. Pixel based methods 
employ very local features such as edge information or indi¬ 
vidual pixel analysis to obtain counts mu. Since they use 
very local features, these methods are mostly focussed on 
density estimation rather than the count. Texture based meth¬ 
ods rely on texture modelling through the analysis of image 
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Fig. 1. Examples of high density crowds. On average, each image in our 
crowd counting dataset contains 870 people with a minimum of 81 and 
maximum of 4633. 


patches d 0 [71 l9lfl4ll. Some texture analysis methods that 
have been suggested include grey-level co-occurrence matrix, 
Fourier analysis and fractal dimension. Object level analysis 
methods try to locate individual persons in a scene IS ED El- 
1201 . But these methods work best only for very low density 
crowds. For high density crowds the evidence for the presence 
of a single person is scarce. Even for low density crowds, 
partial occlusions, variations in clothing and pose, perspective 
effects, cluttered background and other complexities negatively 
impact the performance of object-level methods. 

Brostow and Cipolla ED and Rabaud and Belongie m 
count moving people in videos by estimating contiguous 
regions of coherent motion. Addressing concerns about preser¬ 
vation of privacy in tracking people for counting, Chen et 
al. propose texture based method for counting m. They use 
a mixture of dynamic textures to segment the crowd into 
components of homogeneous motion and then use a set of 
simple holistic features to find the correspondence between 
features and the number of individuals per segment. 

Some works estimate the relationship between low-level 
features and the density or count by training regression models. 
Some of these methods are global, which learn a single 
regression function for the entire image/video mi2ii4iim Rut 
these methods make an implicit assumption that the density 
is the same over the image, which is not valid for most 
images due to perspective effects, viewpoint changes etc. Some 
regression methods can be local which divide the image into 
cells and perform regression for each cell 010. These 
methods deal with the problems associated with global regres¬ 
sion methods efficiently. An alternate multi-output regression 
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model was proposed by Chen et al. in 0. 

The aim of this paper is to develop an effective texture-based 
method to solve the problem of counting the number of people 
in extremely dense crowds. Our goal is to arrive at a method 
that works well for dense crowds but at the same time is robust 
to variations in density. For very dense crowds, a single feature 
or detection method alone can not provide an accurate count 
due to low resolutions, occlusions, foreshortening and perspec¬ 
tive. We build upon the work of Idrees et al. m and propose 
a model that combines sources of complementary information 
extracted from the images. Dense crowds can be thought of as 
a texture and this texture corresponds to a harmonic pattern at 
fine scales. Texture analysis methods have shown promising 
results for crowd counting/density estimation d [lOj QT]. 

Appearance based features like SIFT descriptors are also 
useful to estimate the texture elements. SIFT features have 
been shown to be successful for crowd detection in (9). Idrees 
et al. also used SIFT as one of the features for estimating 
crowd counts. 

The main contribution of this work is the use of multiple 
texture analysis sources to estimate the counts for dense 
crowds. We employ Fourier analysis, GLCM features and 
wavelet transform to analyse the texture information. As far 
as we know, wavelet features have not been used for crowd 
counting or density estimation before. Along with these, we 
use head-detections and SIFT descriptors for our framework. 
The combination of multiple information sources provides 
robustness and accuracy to the counting process. 

Existing methods suffer from severe scalability issues. Most 
existing methods have been tested on low to medium density 
crowds, e.g. , UCSD dataset m (11 - 46 people per frame), 
Mall dataset 0 (13 - 53 people per frame) and PETS 
dataset [22] (3 - 40 people per frame). In contrast, we show 
the performance of our method on the UCF crowd counting 
dataset m of 50 images containing between 96 and 4633 
people per image. Further, we complement the dataset with 50 
more images to expand it to 100 images and demonstrate the 
robustness and accuracy of our method on this new combined 
dataset. 

The remainder of the paper is organised as follows. We 
present our methodology in section Hj provide experimental 
validation and results in section [III and finally discuss and 
draw some conclusions in section TTvl 

II. Methodology 

Our goal is to provide an estimate of the total person count 
in an image. Figure [2] gives an overview of our framework. We 
first partition the image into small cells in a grid. We obtain 
the count estimate from each cell to counter the variations in 
density of the crowd over the image. The final output is the 
sum of all cell counts. 

A. Counting in a cell 

For a given cell P, we estimate the counts and confidences 
from five different sources. These are combined to obtain a 
final estimate of the person count for that cell. (Note: In this 
section we will use ‘cell’/ ‘patch’ to mean the same thing.) 
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Fig. 2. A flow chart illustrating the methodology adopted in this paper, (a) 
The image is first divided into small cells in a grid, (b) Count is estimated 
for each cell by fusing estimates from five different methods. The final count 
estimate for the image is the sum of all cell counts. 

1) Interest-points based count: Arandjelovic (3 used a 
statistical model based on quantised SIFT features to segment 
an image into crowd and non-crowd regions. Subsequently, 
Idrees et al. E) used interest points to estimate counts and to 
get a confidence score of whether the cell represents a crowd. 
We follow this idea to calculate the counts. Given a training 
set, we obtain SIFT features and cluster then into a codebook 
of size K. We use the sparse SIFT features to train a Support 
Vector Regression model using the counts at each patch from 
ground truth and then use the trained model to obtain counts 
for new images patches. We calculate the SIFT features using 
the VL-FEAT library (23J. 

Due to the sparse nature of SIFT features, the probability of 
observing ki instances of the Pth SIFT word can be modelled 
as a Poisson distribution. Suppose, for a cell containing crowd, 
the expected number of detections of the Pth SIFT word is A f. 
Then: 

e ~ x t r\+l fei 

ptkAcrowd) =-—p- (1) 

k{ • 

Similarly for a non-crowd cell (A^ - ): 

e~ x A \X~] ki 

pfkA^crowd) - -— A - (2) 

k{ • 

Assuming independence between counts of any two SIFT 
words in a cell: 

p(ki , kj | crowd ) = p(ki \crowd)p(kj \crowd ) (3) 

The log of the likelihood ratio of crowd and non-crowd 
patches is: 
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Fig. 3. Local maxima (red dots) for some images obtained by Fourier analysis, 
(a), (b) We notice that the maxima correspond very well with heads in dense 
crowds, (c), (d) However, Fourier analysis is crowd blind and cannot detect 
the actual presence of a crowd. 


fi = logp(fci, & 2 , • • •, kK\crowd ) — logp(fci, & 2 , • • •, kK^crowd) 


K 


= LA ~ x i+ k i^°g x i - io s\)] 




(4) 

li can be interpreted as the confidence about the presence 
of crowd in an image cell. 


2) Counts from texture analysis methods: Crowds are 
repetitive in nature since all humans appear similar from a 
distance. We employ three different texture analysis methods 
which separately give an estimate of the count which will be 
used later to give a final estimate of the count in the cell. 

• Fourier analysis 

Fourier analysis can be used to capture the repetitions in 
crowds. Since we are dealing with small cells and not the 
complete image, we can safely assume that the crowd 
density in a cell is uniform. In this case, the Fourier 
transform, /(cc), will show the repeated occurrence of 
people as peaks. 

For a given cell, P, in an image, we calculate the gra¬ 
dient, V(P), and apply a low pass filter to remove high 
frequency components. Then we apply inverse Fourier 
transform to obtain the reconstructed image patch, P r . 
The local maxima in the reconstructed image give an 
estimate of the total person count in that cell. Figure [3] 
shows some images with the local maxima obtained by 
this method marked. We also calculate the several other 
statistical measures, such as entropy, mean, variance, 
skewness and kurtosis for both P r and the difference 
|V(P) — P r \. We use the count and these measures as 


input for the next step (section II-A4). 


• GLCM features 

Marana et al. used texture features based on the gray- 


level co-occurrence matrix (GLCM) to classify image 
patches into categories based on density (TZ]. Other 
people have also used GLCM features for density/count 
estimation I lTB QT[ [241 • We adopt similar features to es¬ 
timate the number of people. We quantise the image and 
calculate the joint-conditional probability density func¬ 
tion, /(i, j\d, 0), with distance, d = 1, and angles, 0 G 
{0°, 45°, 90°, 135°}. We calculate the following texture 
features: Dissimilarity: D(d,6) = JP j j\d, 0)\i — 

j|; Homogeneity: H(d,0) = J2ij i+(i-jp ’ Energy: 
E(d,0 ) = Entropy: P(d,0) = 

f{hj\d,0)^og f{i,j\d,0)- So we obtain 16 (four 
for each 6) features for each image cell. We then train a 
support vector regression model using these features and 
ground truths from cells from the training images. We 
pass the count estimate and statistical measures such as 
variance, skewness and kurtosis of the GLCM matrices 
as features to the next stage. 

• Wavelet decomposition 

We use the multi-resolution properties of the two- 
dimensional wavelet transform to extract features for the 
counting framework. 

Given a cell, P, we calculate the three-step pyramid- 
structured wavelet transform and obtain the 10 
lower resolution sub-images. We then calculate the 
energies contained in each of them as: e = 
mGv £"i EjLi where I is a sub-image with 

resolution M x N. Thus, we obtain a ten-dimensional 
feature vector for each image cell. Texture energies 
are distributed differently for different texture patterns. 
Thus the energy features calculated above can be used 
for discriminating crowds and estimating crowd counts. 
We train a support vector regression with these feature 
vectors using the ground truth counts from the training 
data as outputs. We also calculate statistical measure 
such as variance, skewness and kurtosis of the 10 lower 
resolution sub-images and pass these measures along 
with the count to the next step. 

3) Count from head detections: Detecting humans is not 
possible in dense crowds due to severe occlusions. Only the 
heads may be visible at this scale. So we estimate the count 
by detecting heads in the image. We used a Deformable Part 
Model fL9l trained on the INRIA Person dataset and applied 
only the filter corresponding to heads with a low threshold. 
This is because heads are often very small and partially 
occluded in such images. 

We see that there are many false positives and negatives 
in the detection results (Figure [4]). However, we perform a 
lot better for nearby/larger heads. Since the texture analysis 
methods are crowd-blind and work well mostly for very dense 
crowds, we need SIFT-based analysis and head detections for 
adding robustness to the system such that we can perform 
accurately in relatively low-density environments too. 

Each detection is accompanied by the scale and confidence 
associated with it. For each cell we return the number of 
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Fig. 4. Some head detection results. 


Fig. 5. Some annotated images from the dataset. 




detection, r]head , and means and variances of the scales and 
confidences, to be used in the next step. 

4) Total count in the cell: We densely sample cells from 
the training images and obtain counts and other features 
from all the above methods. We then use the annotations 
to train an e-SVR with the counts and other features from 
above as inputs and the final estimate of the count as 
output. This SVR combines the information obtained from 
the five different sources to give an estimate of the patch count. 

The total person count of the image is finally obtained 
by summing the counts obtained from all cells in the grid. 
Here we are assuming that the cell counts are independent. 
This is a reasonable assumption because we are dealing with 
widely varying viewpoints, perspective effects and crowd 
densities. We believe that putting neighbourhood constraints, 
as done in gu, limits the efficacy of multi-source count 
estimation to images with mostly uniform densities. 

III. Experiments 


A. Dataset 

We first use the publicly available UCF crowd counting 
dataset to compare our results to past work. This dataset 
contains 50 images. These image contain 96 to 4633 people 
with an average of 1280 people per image. The authors of [14] 
provide the ground truth dot-annotations with the images, i.e., 
each person is marked with a dot. There are 63974 annotations 
in the 50 images. 

We, then, collected 50 more images from Flickr and added 
annotations to extend the above dataset to 100 images. We 
included a huge variety of viewpoints and perspectives in these 
images. This was done to ensure that we have an estimate of 
the robustness of this system. A wider variety of perspective 
distortions was missing from the UCF dataset. 

Finally we have 100 images with 87135 annotations contain¬ 
ing, on average, 871 individuals per image with the number 
varying from 81 to 4633. 


B. Evaluation metrics 


We use absolute error (AE) and normalised absolute error 
(NAE) for evaluating the performance. We report the mean 
and deviations of both AE and NAE for both the UCF dataset 
and our extended dataset. 


1 N 

VAE = Tf'52\Vi- Vi\ 


N 

i=1 

i N I 

_ 1 I Vi 

v NAE - N E, — 

i= 1 


\Vi-Vi\ 

Vi 


(5) 

( 6 ) 


where /lae and unae denote the mean of AE and NAE 
respectively, rji is the estimated count, rji is the actual ground 
truth count, and N is the number of cells/images. 

Also, since we are dealing with small cells, we report the 
per-cell performances too. 


C. Evaluation 

We randomly divided the UCF dataset into groups of 10 
and ran 5-fold cross validation. We compare the performance 
of our model with the models presented in H4l . [25] and 0 
in Table [T] These methods are among the very few suited for 
this problem because most of the other methods either rely on 
videos or human detection, and cannot be used with the UCF 
dataset. 

The method presented by Rodriguez et al. |25] used head 
detections for counting while Lempitsky et al. El used SIFT 
features to learn a regression function for counting. The authors 
of fT4] found that [25] performs best around counts of 1000, 
but as we move away on either side, its error increases. 
This because the estimated counts are fairly steady across 
the dataset and do not respond well to change in crowd 
density. It overestimates the counts in the low count images 
and underestimates in the high count images leading to high 
absolute error for both these cases. However, m performs 
well at higher counts but poorly in terms of NAE for lower 
counts. The MESA-distance in 0 is designed to minimise 
the maximum AE across image during training. Images with 
higher counts tend to have higher AE, and thus, the algorithm 
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Method 

AE 

NAE 

Rodriguez et al. 

655.7 

± 

697.8 

0.706 ± 1.02 

Lempitsky et al. 

493.4 

± 

487.1 

0.612 ± 0.916 

Proposed 

514.1 

± 

526.4 

0.542 ± 0.484 

Idrees et al. 

419.5 

± 

541.6 

0.313 ± 0.271 


TABLE I. Quantitative comparison of the proposed method 
with Rodriguez et al (25), Lempitsky et al. 0, and Idrees et 

al. lfl4) USING THE MEANS AND STANDARD DEVIATIONS OF ABSOLUTE 

Error (AE) and Normalised Absolute Error (NAE) for the UCF 

CROWD COUNTING DATASET. THE PROPOSED ALGORITHM OUT-PERFORMS 
(25) AND 0, BUT IS OUTPERFORMED BY A MUCH MORE 
COMPUTATIONALLY EXPENSIVE MODEL FROM fTT) . 


AE NAE 

Per-patch 9.5 ± 14.682 

Per-image 377.7 ± 480.8 0.666 ± 1.123 

TABLE II. Per-patch and per-image results for the complete 
DATASET OF 100 IMAGES. 


focusses mainly on these images. The model gets biased 
towards high density images but performs poorly for low 
density ones. From Figure [SJ we see that the proposed method 
too performs poorly for some low-count images. However, the 
method performs quite well in the middle and high count range 
(>1000 individuals per image). 

We observe that, unlike our method, the other three methods 
perform poorly for the most high density images. All of them 
tend to underestimate the counts in that range. The first thing to 
note that most images in this category are very high-resolution. 
They have a very low chance of having individuals missed 
during annotation. Also, the per cell density increases super- 
linearly for this group, which is linear for other categories. 
Since, there are very few of such images, they could have 
been treated as outliers during training. Our method gives 
mostly average or better performances on such images. It relies 
heavily on the texture information from the images (Fourier 
analysis, GLCM features, wavelet features) to estimate the 
counts. So it performs well on the higher density crowds, since 
the texture approximation is best applicable to such crowds. 

We also evaluated the performance of our proposed method 
on the extended dataset containing 100 images. This dataset 
has much more diversity than the original dataset because it has 
crowds present in varying densities and visible from various 
viewpoints. This is useful for testing the robustness of the 
algorithm. For this case, we divided the dataset into sets of 
25 and ran 4-fold cross validation. The final per-patch and 
per-image results are shown in Table [II] 

Table [II] gives the performance of the algorithm on the final 
dataset in terms of mean absolute error and mean normalised 
absolute error at both the patch level and image level. We 
obtain a mean absolute error of 377.7 with a standard deviation 
of 480.8 and a mean normalised error of 0.666 with standard 
deviation 1.123. The reason of a seemingly higher NAE is ev¬ 
ident from Figure [6] There are a very small number of images 
in the low crowd-density category (below 500 individuals per 
image range) which drive the mean NAE up. Our method does 
not work quite well for some low density images. Figures [8] 
and [9] show the images with the lowest and highest absolute 


errors respectively. 

We observe that most of the images for which we get high 
absolute errors are very high density crowds. These images 
mostly contain extreme perspective variations. Also, some of 
the images have lens distortions which may be a reason for 
poor estimates. We also note that the NAE is very high for 
some of the images in the low density region. We believe that 
texture methods do not perform very well for such images. 
We are using head detections and interest points analysis as 
parts of our system. Further research in this area could focus 
on finding ways to pre-determine the density of crowds in 
different image regions so that these methods (head detections 
and interest-point analysis) could be given more importance 
for those regions. We also note that only a very few images 
are driving the average absolute error and NAE up. Removing 
just the worst 10% performing images from the final dataset 
and considering the rest 90 images reduces the absolute error 
to 256.3 ± 217.7 and the NAE to a very low 0.407 ± 0.328. 

Figure [7] shows the per patch performance of the algorithm. 
Black dots are the mean absolute errors per patch, red bars 
represent the standard deviations, and blue diamonds are the 
actual average number of individuals per patch. We observe 
that, for higher density crowds, the mean absolute error per 
patch increase with increase in actual count. The absolute error 
per patch is almost constant, and very small, till around image 
90, i.e., for images with counts less than about 2000. This is 
a demonstration of the efficacy of algorithm presented. 

IV. Conclusion 

We considered a method for estimating the number of people 
in extremely dense crowds from still images. The counting 
problem at this scale has barely been tackled before. We pre¬ 
sented a method that uses information from multiple sources 
to estimate the count in an image. We used head detections, 
interest points based counting and texture analysis methods 
(Fourier analysis, GLCM features and wavelet analysis) as 
the different sources of information. Each of these constituent 
parts gives an independent estimate of the count, along with 
confidences and other features, which are then fused to give 
a final estimate. We present results of extensive tests and 
experiments we performed. We also introduced a new dataset 
of still images along with annotations which can complement 
the existing UCF dataset. The results are very promising and, 
since the model is extremely simple, it can be applied for real¬ 
time counting in critical areas like pilgrimage sites and other 
areas where which present a danger of stampedes. 
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Fig. 6. Normalised Absolute Error (NAE) vs. the ground truth counts for (a) 
the UCF 50 images; and (b) the complete dataset. 



Fig. 7. Analysis of patch estimates in terms of absolute error per patch. The 
image numbers have been sorted with respect to the actual counts. Black dots 
are the mean absolute errors, red bars represent the standard deviations and 
blue diamonds are the ground truths. 
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Fig. 8. Estimated counts for some images with the lowest absolute errors 
(AE). 
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Fig. 9. Estimated counts for some images with the highest absolute errors 
(AE). 
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